CS432 - Fall 2001
Assignment 5 - B+-Tree
Deadline: November 20, 23:59pm

Check this page and the newsgroup frequently for updates. The extra credit part of A3 will be graded in a week.

FAQ

·         Question: It seems that clock() returns only the processor time, not the overall time.
Answer: Use time() instead to get accurate timings.

·         Question: In reporting the statistics, ie. running time and number of page misses, do we have to vary the number of tuples in each relation besides varying the buffer pool size?
Answer: Use a large number of tuples to get a stable running time and repeat your experiments a few times as stated below in under "statistics collection". You do not have to vary the number of tuples.

·         Question: When we do the join, should we include both of the join attributes or just one copy?
Answer: Use the function MakeNewRecord() to make the new record.

Introduction
For this assignment you are expected to implement three join algorithms and evaluate their relative performance.

Background
Let us quickly review some of the modules you will need for this assignment. You should already be familiar with them from the the previous assignments.

In Minibase, a relation is implemented as a heap file, which is a collection of records. Records can be inserted or deleted from a heap file, and each record is uniquely identified by a record id. A scan is the interface used to access records in a heap file, one by one.

An index provides fast access to records in the heap file, and currently Minibase only supports B+-indices. Each entry in the index is a (key, record id) pair. Entries can be inserted or deleted from an index. An index search scan provides interface for accessing the records in the index.

Four classes are provided for you : (i.e. you only need to call methods in these classses)

Refer to the reference section for the details of these classes and how to use them.

Statistics collection
Augment the buffer manager class given to you to collect statistics. Specifically, you should be able to tell how many PinPage requests have been made and how many of those miss. It is suggested that you write two functions to reset and report the statistics.

You can use the time() function (which returns the current time) to determine out the running time of your join algorithm. You will need to include time.h. Refer to the C++ help for more details on time().

You should let your algorithm run for a few times (the longer the better) and report the average running time. Avoid running other CPU or I/O intensive processes (Word, Netscape, etc.) while collecting statistics since clock() reports actual time rather than CPU time. Print out average running times and the number of misses for every join algorithm you implement.

Tuple-at-a-time nested loop join.
Start with this one, it is the simplest to implement.

Block nested loop join
The algorithm for this join method can be found in your textbook. Since Minibase does not provide page-by-page access into a relation stored in a heap file, you will simulate a "block" with an array storing the records. The join function will take as a parameter, an integer B (block size = array size) of the outer relation. You are not required to implement the double buffering techniques or use hash tables.

The pseudocode for this join is:

For each block B in R
  For each tuple s in S
    For each tuple r in B
      Match r with s
      if Match then
        Insert (r,s) into the result relation

Compare the performance of "Block nested loop" join for various block sizes.

Index nested loop join
Implement this algorithm based on the pseudo-code in the textbook.

First create an unclustered B+-tree index on the inner relation. Determine whether it is beneficial to build a B+-tree index for the purpose of performing a single join operation by comparing the cost of this join with that of other join methods.

Performance comparison

Simplifications

Source code
The source code for the project is located in the directory \\goose\courses\cs432-fall01\a5\code. The directory contains:

You should write your code in <join-method>.cpp and main.cpp. Investigate the code provided to understand how the test program works. Functions in join.cpp and relation.cpp will be useful for writing your join methods and for debugging. The function SortFile() is particularly useful as an example on how to use HeapFile, Scan, BTreeFile and BTreeFileScan. To start working on the project, copy the directory to your workspace, and double-click on a6.dsw.

Hints

Reference
Those in bold are classes and types that are particularly useful for this assignment. You don't need to know the rest to complete the assignment, but if you wish to study the code provided, they may be of help.

Coding convention
You need to follow certain coding convention for all your assignments:

Submission procedure

How to hand in: You are a group of two students. Drop only the following five files into your handin directory on goose (\\goose\courses\cs432-fall01\handinA5).

So if students with netids "abc1" and "xyz2" form a group together, one of the two handin directories (for example the directory \\goose\courses\cs432-fal01\handinA5\abc1) will contain the five files for the group. Make sure that your program can be recompiled as needed. You can submit the executable file, but we probably won't look at it. Delete the empty handin directory of the other student.

Keep a copy of the project in your own account just in case.

Grading criteria
Your assignments will be graded according to the following criteria: